-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] table sample #52600
base: main
Are you sure you want to change the base?
[Feature] table sample #52600
Conversation
64435a6
to
46bfe29
Compare
acf9bfe
to
bc37039
Compare
@@ -259,9 +261,11 @@ class ColumnIterator { | |||
|
|||
virtual Status null_count(size_t* count) { return Status::OK(); }; | |||
|
|||
// RAW interface, should be used carefully | |||
virtual ColumnReader* get_column_reader() { return nullptr; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest to
virtual ColumnReader* get_column_reader()=0;
or use std::optional<ColumnReader*> as a return type to tell sb that the return type must be checked
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add an check in it, actually the unsupported ColumnIterator
should not be called.
std::bernoulli_distribution dist(_probability_percent / 100.0); | ||
|
||
size_t sampled_blocks = 0; | ||
size_t total_blocks = _total_rows / _rows_per_block; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how _rows_per_block is deduced? it seems _rows_per_block is deduced from rows per block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's retried from Segment::num_rows_per_block
, which is persisted in segment footer.
and also, it can be configured through config.default_num_rows_per_column_file_block
, whose default value is 1024.
for (size_t i = 0; i < _total_rows; i += _rows_per_block) { | ||
if (dist(mt)) { | ||
sampled_blocks++; | ||
sampled_ranges.add(RowIdRange(i, std::min(i + _rows_per_block, _total_rows))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- can RowIdRange span two blocks ? a interesting question is what's the math expect of number of RowIdRanges spanning two block when stride(_rows_per_block) varies. is there an optimized stride?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would say not, each RowIdRange is expected to cover exactly one block
|
||
double SortableZoneMap::width(const Datum& lhs, const Datum& rhs) { | ||
if (lhs.is_null() || rhs.is_null()) { | ||
return 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is 0, it seems that is a open range, it seens that infinity or maximum value is reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it depends.
here the width is used to construct the histogram, the null value cannot be put into the histogram, so i just ignore them via set the width to 0
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
Signed-off-by: Murphy <[email protected]>
bc37039
to
a958475
Compare
Signed-off-by: Murphy <[email protected]>
5eb8220
to
fffcb42
Compare
Quality Gate failedFailed conditions |
[Java-Extensions Incremental Coverage Report]✅ pass : 0 / 0 (0%) |
[BE Incremental Coverage Report]❌ fail : 166 / 324 (51.23%) file detail
|
[FE Incremental Coverage Report]❌ fail : 15 / 85 (17.65%) file detail
|
Why I'm doing:
What I'm doing:
Introduce a syntax for table sample:
Properties:
seed
: optional, specify the random seed, which can guarantee the deterministic resultmethod
: optional,by_block
orby_page
percent
:(0, 100)
, sample probabilityFixes #52787
What type of PR is this:
Does this PR entail a change in behavior?
If yes, please specify the type of change:
Checklist:
Bugfix cherry-pick branch check: